Objective: to generate the basic knowledge for importing, modifying, writing and plotting data as well as writing loops using the R Project for statistical computing.
Contents:
To import data to R we can use the read.table function, which returns a data frame object.
read.table(file, header = FALSE, sep = "",
dec = ".", row.names, col.names, na.strings = "NA",
skip = 0, stringsAsFactors = FALSE, ...)file: the name of the file which the data are going to be read from.
header: a logical value indicating whether the file contains the name of the variables as its first line.
sep: the field separator character. Values on each line of the file are separated by this character.
read.table(file, header = FALSE, sep = "",
dec = ".", row.names, col.names, na.strings = "NA",
skip = 0, stringsAsFactors = FALSE, ...)dec: the character used in the file for decimal points.
na.strings: a character vector of strings which are to be interpreted as NA values.
stringsAsFactors: [Logical] should character vectors be converted to factors?
Remember, if we don’t add one of these inputs, the default setting will be used.
Let’s create a table in Excel
Save the table as a csv file:
We use the path of the file in the read.table function.
We could also use the read.csv function:
The read.table, read.csv and read.csv2 functions, can be used to import data. Note that the read.csv and read.csv2 functions come from read.table but have different default settings.
read.table header = FALSE, sep = " "
read.csv header = TRUE, sep = ","
read.csv2 header = TRUE, sep = ";"
dim: returns the number of rows and columns (i.e. the dimensions) of the object.
Now we can add a new column to the existing data frame. As an example, we will calculate the values of precipitation minus evapotranspiration:
Accessing data is the same for matrices and data frames. For example, to extract the value located in the second row and fourth column:
Or:
Also we can subset the first three values of the third row.
Or make a subset of the data frame.
For exporting data, we can use the write.table function, which writes a data frame to a file or connection.
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
na = "NA", dec = ".", row.names = TRUE, ...)x: the object to be written. Preferably a matrix or data frame.
file: the file path where we want to save the file (including file name and extension).
append: [Logical] only relevant if the file is a character string. If TRUE, the output is appended to the file. If FALSE, any existing file of the name is destroyed.
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
na = "NA", dec = ".", row.names = TRUE, ...)quote: [Logical] if TRUE, any character or factor columns will be surrounded by double quotes.
sep: the field separator string. Values within each row of x are separated by this string.
na: the string to use for missing values in the data.
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
na = "NA", dec = ".", row.names = TRUE, ...)dec: the string to use for decimal points in numeric or complex columns: must be a single character.
row.names and col.names: [Logical] indicating whether the names are to be written along with x, or a character vector of names to be written.
Now, we will save the table that we have edited.
write.table(p_eta_table, paste0(dirname(data.path), "/table_modified.csv"),
row.names = FALSE, sep = ",")Now, you can go to the corresponding folder and check whether the file has been written.
Additionally, we can save a single R object to a connection and later on restore it by using the saveRDS and writeRDS functions.
We can use the bind function to combine the rows from two separate tables to form one table. For this example, we will load tables with the most populous cities from both Spain and Italy.
spain.pop <- read.csv("...add file path…/Cities/Spain.csv", header = TRUE)
italy.pop <- read.csv("...add file path…/Cities/Italy.csv", header = TRUE)## City Population
## 1 Madrid 3334730
## 2 Barcelona 1664182
## 3 Val\xe8ncia 800215
## 4 Sevilla 691395
## 5 Zaragoza 681877
## 6 M\xe1laga 578460
## 7 Murcia 459403
## 8 Palma 422587
## 9 Las Palmas de Gran Canaria 381223
## 10 Bilbao 350184
First, let’s add information about the country to the table, so that the combined table will include the country for each entry.
spain.pop$Country <- rep("Spain", times = nrow(spain.pop))
italy.pop$Country <- rep("Italy", times = nrow(italy.pop))
print(spain.pop)
## City Population Country
## 1 Madrid 3334730 Spain
## 2 Barcelona 1664182 Spain
## 3 Val\xe8ncia 800215 Spain
## 4 Sevilla 691395 Spain
## 5 Zaragoza 681877 Spain
## 6 M\xe1laga 578460 Spain
## 7 Murcia 459403 Spain
## 8 Palma 422587 Spain
## 9 Las Palmas de Gran Canaria 381223 Spain
## 10 Bilbao 350184 Spain
print(italy.pop)
## City Population Country
## 1 Roma 2808293 Italy
## 2 Milano 1406242 Italy
## 3 Napoli 948850 Italy
## 4 Torino 857910 Italy
## 5 Palermo 647422 Italy
## 6 Genova 565752 Italy
## 7 Bologna 395416 Italy
## 8 Firenze 366927 Italy
## 9 Bari 315284 ItalyNow we can use the rbind function to join the tables.
Similarly, the cbind function will combine objects by columns.
The plot function is a generic function. This means that the type of plot produced is dependent on the type or class of the first argument.
x: the coordinates of points in the plot (x-axis) y: the coordinates of points in the plot (y-axis) main: a title for the plot sub: a subtitle for the plot xlab and ylab: titles for the x and y axes
Note: y can be omitted if x has the appropriate structure (e.g., for a raster file)
type: what type of plot should be drawn. Possible types are:
To show the first set of plotting examples, we will read a csv file with World Bank data for annual mean cereal crop yield for both the world and for Germany.
When we have a lot of data entries in stored in an object, we can use the head and tail functions to look at the first six and last six entries, respectively.
Note: the values presented are in kg/ha.
head(crop.yield)
## Year Global_Cereal_Yield German_Cereal_Yield
## 1 1961 1431.537 2417.4
## 2 1962 1523.116 2962.2
## 3 1963 1589.004 2925.2
## 4 1964 1589.813 3120.8
## 5 1965 1639.062 2852.2
## 6 1966 1680.538 2878.0
tail(crop.yield)
## Year Global_Cereal_Yield German_Cereal_Yield
## 52 2012 3619.562 6964.9
## 53 2013 3824.374 7318.0
## 54 2014 3892.360 8050.3
## 55 2015 3938.770 7497.8
## 56 2016 3967.029 7182.1
## 57 2017 4074.176 7269.9Let’s start with a line chart for the global values.
Note: we will get the same result by selecting the column numbers instead of the names: i.e., plot(crop.yield[,1], crop.yield[,2], type = “l”).
Now, we will redo the plot in a more customised manner (chart title, axes titles, user defined minimum and maximum axis values).
plot(crop.yield$Year, crop.yield$Global_Cereal_Yield, type = "l",
main = "Mean Cereal Yield", xlab = "Year",
ylab = "Mean yield (kg/ha)", ylim = c(0, 10000))The lines function is used to add another line to an existing plot. The legend function is used to add a legend to the plot.
lines(crop.yield$Year, crop.yield$German_Cereal_Yield, col = "red")
legend("topleft", c("Global", "Germany"), lty = 1, col = c("black", "red"))The points function is used to add points to an existing plot. pch selects to the plotting character and col the plotting colour.
Example: we can also plot the information using a scatter plot.
Note: the default plot option is with points. Therefore, if we do not specify the type, we will have the same result.
A histogram is a graphical representation of the distribution of numerical data. - It is an estimate of the probability distribution of a continuous variable.
Example: the daily temperature values (in °C) for April in a particular city are:
The function hist will generate a histogram.
Again, we can customise the histogram.
hist(temp, main = "Histogram of temperature in April", xlab = "Temp [°C]",
border = "blue", col = "cyan", breaks = 8)Remember to think carefully about how you plot and analyse your data - Looking at the previous example, if we plot the data with 5 breaks and with 8 breaks:
The function boxplot produces box-and-whisker plot(s) of (grouped) values.
Now, we will load daily values for maximum temperature in a pilot city.
print(daily.temp)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 15.0 17.8 20.6 32.2 27.2 22.8 32.2 31.1 30.0 30.6 24.4 17.2
## 2 15.6 16.7 18.3 22.8 29.4 22.8 32.2 33.9 29.4 31.1 28.9 17.2
## 3 12.8 16.1 20.0 21.1 35.0 26.1 32.2 32.8 28.3 27.2 25.6 16.7
## 4 17.8 18.9 23.3 27.2 32.8 23.3 31.7 32.8 31.7 19.4 15.6 17.2
## 5 21.7 21.7 27.8 24.4 30.0 25.6 33.9 30.0 27.8 17.2 17.8 17.8
## 6 21.1 25.0 20.0 19.4 25.6 22.8 34.4 29.4 36.7 18.9 14.4 18.3
## 7 16.1 26.7 15.0 16.7 22.2 23.3 32.8 29.4 37.8 22.8 17.8 19.4
## 8 15.0 17.2 21.1 14.4 20.0 22.2 31.1 29.4 37.8 27.2 21.1 19.4
## 9 15.0 21.1 28.9 14.4 17.2 21.1 30.6 27.2 27.8 28.9 25.6 21.1
## 10 17.8 20.6 26.7 20.0 21.7 21.1 29.4 26.7 23.9 28.3 23.9 21.7
## 11 21.1 23.3 23.9 22.2 22.8 23.3 27.8 26.7 26.7 28.3 25.0 15.6
## 12 22.8 26.7 19.4 21.7 25.6 22.8 25.0 26.7 29.4 38.3 17.8 11.7
## 13 23.9 23.9 21.1 18.9 24.4 25.6 25.0 25.6 27.8 37.2 22.8 12.2
## 14 27.2 19.4 25.0 23.3 20.0 27.2 23.3 28.3 25.6 31.1 21.7 16.1
## 15 26.7 19.4 25.6 30.0 17.8 27.2 24.4 28.9 25.6 27.2 20.6 15.0
## 16 27.2 16.1 24.4 33.3 18.9 25.6 25.0 30.0 22.2 26.1 21.7 17.8
## 17 28.9 16.1 22.8 27.8 17.2 23.9 26.7 31.1 24.4 29.4 21.7 19.4
## 18 28.9 18.9 20.6 18.3 17.2 25.0 28.9 30.0 27.2 26.7 15.6 14.4
## 19 20.6 14.4 17.2 20.6 20.6 24.4 30.6 27.2 27.2 22.8 16.1 14.4
## 20 24.4 13.9 15.6 20.0 22.2 27.8 31.1 27.8 27.8 23.9 16.1 17.8
## 21 25.0 16.1 15.6 20.0 25.6 32.2 30.6 26.7 28.3 22.8 16.1 18.3
## 22 22.2 16.1 16.7 20.6 21.1 30.6 27.8 29.4 30.6 28.9 19.4 18.9
## 23 24.4 15.6 19.4 20.6 20.0 29.4 28.9 31.1 28.3 28.9 17.8 20.0
## 24 23.9 15.6 16.7 20.0 22.8 27.8 28.3 31.7 25.0 21.7 16.7 22.8
## 25 23.3 15.0 16.7 22.2 26.1 29.4 33.3 33.3 23.3 20.0 18.9 23.3
## 26 25.0 11.7 15.6 22.8 23.3 29.4 30.0 35.0 26.7 21.1 27.8 23.3
## 27 25.0 13.9 18.9 28.3 26.7 31.7 26.1 36.1 30.0 24.4 28.9 21.1
## 28 24.4 18.9 18.3 28.3 22.2 28.3 27.8 33.3 27.8 27.2 27.2 26.1
## 29 20.6 NA 23.3 22.8 17.8 26.7 34.4 35.0 27.8 28.3 26.7 25.6
## 30 15.0 NA 29.4 26.7 26.1 28.9 36.7 32.2 31.1 31.1 23.3 22.8
## 31 17.2 NA 32.2 NA 26.1 NA 29.4 30.0 NA 28.9 NA 25.0To generate a boxplot with the data:
boxplot(daily.temp, main = "Daily Max. Temp of Pilot City in 2011",
col = "cyan", ylab = "Max temp (degrees C)")The barplot function creates a bar chart with vertical or horizontal bars.
Let’s customise the bar chart.
barplot(crop.yield$Global_Cereal_Yield, main = "Global Cereal Yield",
sub = "Data from World Bank", xlab = " Year", ylab = "Yield (kg/ha)", names.arg = crop.yield$Year)Some other options to change how your data is presented in a bar chart:
col to add colour to the plot
horiz to create a horizontal bar chart (set to TRUE)
The function pie draws a pie chart.
slices <- c(10, 12, 4, 16, 8)
country <- c("Canada", "UK", "Australia", "Germany", "France")
pie(slices, labels = country)More personalised:
R uses the following relational operators:
< lower than
> greater than
<= lower than or equal to
>= greater than or equal to
== equal to
!= not equal to
For example:
Loops are very important because they allow us to do the following:
Only run a certain code if a condition is met (if)
Run the same process a specified number of times (for)
Continue running a process until a condition is met (while)
R is able to perform conditional executions of the form:
The conditional statement must evaluate a single logical value, the process 1 will be run if the condition is met, while the process 2 will run when the condition is not met.
For example:
Or:
If the processes contained in the condition require multiple lines, they should be written inside of curly brackets.
Note that the else statement is not always required:
An example of an else clause within an if loop is as follows:
Sometimes, we want to run a loop if any one of a set of conditions is met.
The for loop has the form:
The variable will be iterated during the for loop taking each one of the values defined in the vector. The process will be applied in each iteration. For example:
Similarly to the if statement, if the process has more than one line of code, it should be surrounded by curly brackets.
If you write a loop within a loop, we refer to this as a nested loop,
We can use a for loop in a spatial time series analysis.
The while loop has the form:
The condition will be evaluated, if it is true, the process will be run. When the condition is not longer met, the while loop will stop.
For example: